This report presents a structured MOFA2 analysis integrating transcriptomics, metagenomics, and metabolomics data. The goal is to identify latent factors capturing shared and view-specific sources of variation, and to generate a set of publication-ready figures that can be combined into a multi‑panel figure.
Main outcomes: - Latent factor structure across multi‑omics layers - Variance explained per factor and per view - Associations between factors and clinical / technical covariates - Feature loadings driving key factors
Analysis workflow (step‑by‑step)
Load required libraries and define plotting palettes
Multi‑omics matrices and sample metadata are loaded from pre‑processed RDS/CSV files. These data have already undergone quality control and normalisation upstream.
MOFA describes each omic dataset as a “view”. One of the possible inputs to mofa is a list of matrixes for each view. Each omics layer is formatted as a features × samples matrix. Sample ordering needs to be the same. The top 5000 most higly variable genes identified in the QC_Transcriptomics.qmd script is used to filter the transcriptomics data.
# remove long strain id, replace with intMG_matrix <- MG_matrix[["clr_data"]][["strain"]]new_names <-sub("\\s+\\S+$", "", colnames(MG_matrix)) # remove strain idnew_names <-make.unique(new_names, sep =" ") # Make names uniquecolnames(MG_matrix) <- new_names # Assign new names# select 5000 most higly variable genesTRX_HVG <- TRX_batch[top_hvg$gene[1:5000], ]views <-list(Transcriptomics =as.matrix(TRX_HVG),Metagenomics =unclass(t(MG_matrix)),Metabolomics =as.matrix(MB_matrix))
4.2 Convert matrices to long format
The matrix list input is only valid if you have matching samples for all “views”. If you wish to include samples that are not “complete” in this sense the long format version is another input alternative to create a MOFA model
With our samples selected we can now plot a overviw heatmap showing all our included samples, which omic views they are represented in and how they are distributed by some of our groups of intrest
Code
heatmap_prepp <-function(keep_s, vars, arrange_by ="Clinical_score_(0-5)") {### ------------------------------------------------------------### 1. Select symptom columns and prepare binary clinical data### ------------------------------------------------------------# Extract only samples of interest and symptom score columns# Convert TRUE/FALSE columns to numeric 0/1 and clean column names df_bin <- m %>%#filter(sample %in% m$sample[m$at_least_two & m$all_BL]) %>%filter(sample %in% m$sample[keep_s]) %>%left_join( .,select(meta, sample, RVVC_AS, RVVC_pos, "Clinical_score_(0-5)"),by ="sample" ) %>%mutate(RVVC_AS =factor(RVVC_AS, levels =c("RVVC", "Control", "RVVC_AS")) ) %>%mutate(RVVC_pos =factor( RVVC_pos,levels =c("RVVC_pos","RVVC_neg","Control_pos","Control_neg"#"NA_pos",#"NA_neg" ) ) ) %>%arrange(desc(n_views)) %>%arrange(RVVC_pos) %>%arrange(desc(`Clinical_score_(0-5)`)) %>%arrange(desc(.data[[arrange_by]]) ) %>%select(1:4) %>%mutate(across(where(is.logical), as.integer)) %>%column_to_rownames("sample") # convert sample ID to rownames# Convert to numeric matrix for ComplexHeatmap mat <-as.matrix(df_bin)# Determine row annotation height scaling factor my_h <<-0.1*nrow(mat)### ------------------------------------------------------------### 2. Prepare annotation dataframe (metadata)### ------------------------------------------------------------ annot_col <- meta %>%filter(sample %in%rownames(df_bin)) %>%# keep same samples as heatmap matrixarrange(match(sample, rownames(df_bin))) %>%select(sample, any_of(vars)) %>%# select annotation variablesmutate(across(all_of(vars), ~factor(.x))) %>%#mutate(across(1:length(vars) + 1, ~ factor(.x))) %>% # ensure annotations are factorscolumn_to_rownames("sample") # convert sample ID to rownames# Quick check: ensure annotation rows match heatmap rowsdim(mat)dim(annot_col)setdiff(rownames(mat), rownames(annot_col))### ------------------------------------------------------------### 4. Define row annotations for heatmap (ComplexHeatmap)### ------------------------------------------------------------ annot <-HeatmapAnnotation(df = annot_col[, c(vars)],which =c("row"),#which = c("column"),# Clin_gr = annot_col$Clin_gr, # binned clinical grade# "Clinical_score_(0-5)" = annot_col$`Clinical_score_(0-5)`, # raw clinical score# RVVC_AS = annot_col$RVVC_AS, # sample group# BV = annot_col$BV, # sample group# RVVC = annot_col$RVVC, # sample group# hyphae = annot_col$hyphae, # sample groupshow_annotation_name =TRUE, # show labelsannotation_name_gp =gpar(fontsize =8), # label font size# Legend formattingannotation_legend_param =list(grid_height =unit(.1, "mm"),grid_width =unit(2, "mm"),title ="",labels_gp =gpar(fontsize =7),title_gp =gpar(fontsize =8) ),simple_anno_size =unit(.3, "cm"), # annotation row height# Colors for each annotation variablena_col ="white",col = pal[vars] %>%imap(., ~set_names(.x, levels(annot_col[[.y]]))) )### ------------------------------------------------------------### Create heatmap### ------------------------------------------------------------ H <-Heatmap( mat,width =unit(ncol(mat) *5, "mm"),height =unit(nrow(mat) *2.2, "mm"),name ="Symptom Present",col = col_fun,# vertical orientation:right_annotation = annot,cluster_rows =FALSE,row_order =rownames(mat),row_names_side ="right",column_names_rot =90,row_names_gp =gpar(fontsize =7),heatmap_legend_param =list(at =c(0, 1),labels =c("No", "Yes") ),# cell_fun = function(j, i, x, y, width, height, fill) {# grid.rect(x, y, width, height, gp = gpar(col = "grey70", lwd = 0.5, fill = NA))# }layer_fun =function(j, i, x, y, width, height, fill) {# Draw cell borders without overwriting fillgrid.rect(x = x,y = y,width = width,height = height,gp =gpar(col ="grey70", lwd =0.4, fill =NA) ) } )#return(list(mat = mat, annot = annot, h = my_h))return(H)}
We plot unsupervised umaps coloured by different groups of intrest in order to see how they cluster across views.
!NB should think about doing with and without HVGs only
# filter samples from views for UMAP plotingm_ <- m %>%filter(sample %in% .$sample[m$at_least_two & m$all_BL])v <- views %>%imap(., ~ .x[, m_$sample[m_[[.y]]]])
This first code is taken from Paulos analysis of the metabolomics data. It gives more control over the knn clustering by perfoming it in a seperate step and then supplying the knn to the umap function. The other alternative bellow is just supplying the data to umap, which then runs the PCA and knn internaly with default setings.
#### group information ####annot_col <- meta %>%#filter(sample %in% .$sample[m$at_least_two & m$all_BL]) %>%# select(-`Age (years)`) %>%select(1:22,`Clinical_score_(0-5)` ) %>%mutate(across(2:ncol(.), ~factor(.x))) %>%mutate(txt_clin =ifelse(grepl("0", .$`Clinical_score_(0-5)`),NA,as.character(.$`Clinical_score_(0-5)`) ) ) %>%mutate(txt_id =ifelse(RVVC =="RVVC", .$sample, NA))#### UMAP analysis ####set.seed(1)umap_dfs <- v %>%map(., ~t(.x)) %>%map( .,~ uwot::umap( .x,#seed = 123,seed =463,n_neighbors =15,#metric = "cosine",init ="spectral",scale = T ) ) %>%map( .,~tibble("UMAP 1"= .x[, 1],"UMAP 2"= .x[, 2],"ID"=rownames(.x) ) ) %>%map(., ~left_join(., annot_col, by =c("ID"="sample")))meta_clus <-colnames(annot_col)color_vars <-c("pos","trx_louvain_7","metab_louvain","microb_louvain","integ_louvain","Clinical_score_(0-5)","RVVC_pos","RVVC_AS") %>%set_names(.)#### Combine and save plots ####make_umap_panel <-function(var, umap_dfs, outdir ="../Results/UMAP/") {# create plots for each dataset p_list <-imap( umap_dfs,~UMAP_fun(.x, var, "txt_clin", title = .y) )# combine with shared legend p_combined <- patchwork::wrap_plots(p_list) + patchwork::plot_layout(guides ="collect") &theme(legend.position ="right")# safe filename fname <-gsub("[^A-Za-z0-9_]", "_", var)# saveggsave(filename =paste0(outdir, "UMAP_", fname, ".png"),plot = p_combined,width =14,height =5,dpi =300 )return(p_combined)}plots <-map(color_vars, ~make_umap_panel(.x, umap_dfs))plots
$pos
$trx_louvain_7
$metab_louvain
$microb_louvain
$integ_louvain
$`Clinical_score_(0-5)`
$RVVC_pos
$RVVC_AS
Code
# use shapes:# UMAP_fun(umap_dfs$Metabolomics, "trx_louvain_7", "txt_clin", shape_gr = "pos", shape = c(21, 24))# no shapes# UMAP_fun(umap_dfs$Metabolomics, "trx_louvain_7", "BV")
6. MOFA model construction
A MOFA object is created and configured with default data and training options. The number of latent factors is set a priori and can be tuned depending on variance explained.
Data options
Important arguments
scale_groups
Scale groups to the same total variance? Default:FALSE
scale_views
Scale views to the same total variance? Default:FALSE
views
Names of the views to include in the model.
groups
Names of the sample groups.
Model options
Important arguments
num_factors
Number of latent factors.
likelihoods
Likelihood per view.
Options: "gaussian", "poisson", "bernoulli"
By default, likelihoods are inferred automatically.
spikeslab_factors
Use spike-and-slab sparsity prior on the factors? Default:FALSE
spikeslab_weights
Use spike-and-slab sparsity prior on the weights? Default:TRUE
ard_factors
Use Automatic Relevance Determination (ARD) prior on the factors? Default:TRUE when using multiple groups.
ard_weights
Use Automatic Relevance Determination (ARD) prior on the weights? Default:TRUE when using multiple views.
Training options
Important arguments
maxiter
Maximum number of training iterations. Default:1000
convergence_mode
Convergence speed of the algorithm.
Options: "fast", "medium" (default), "slow"
For exploratory analyses, "fast" is usually sufficient.
seed
Random seed for reproducibility.
We use default options, however it is recomended to run the model several times with different seeds. In one of the papers published by MOFAS authors, they ran the model ~24 times, before picking the model with the highest elbo score.
# inspect data distributionggdensity(mofa_df, x ="value", fill ="gray70") +facet_wrap(~view, nrow =3, scales ="free") # + xlim(0.5,)
We first assess the overall quality of the model by inspecting variance explained and factor correlations.
# inspect the model:plot_data_overview(MOFAmodel)
MOFAmodel
Trained MOFA with the following characteristics:
Number of views: 3
Views names: Metabolomics Metagenomics Transcriptomics
Number of features (per view): 90 244 5000
Number of groups: 1
Groups names: single_group
Number of samples (per group): 89
Number of factors: 15
A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit. Reasons could be that you used too many factors or perhaps the normalisation is not adequate.
Warning in .quality_control(object, verbose = verbose): Factor(s) 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total 'levels' between your samples, *sometimes* because of poor normalisation in the preprocessing steps.
samples_metadata(MOFAbatch) <-samples_metadata(MOFAbatch) %>%left_join(meta, by ="sample")plot_factors( MOFAbatch,factors ="all",color_by ="batch")
When running MOFA the first time, the internal quality control warned that at least one of the factors was strongly corelated with the total number of expressed features in one of the views. This is typically a result of batch effect. As seen in the plot above Factor 1 perfectly seperates the two batches of the transcriptomics data. Thus batch correction was performe in the upstream QC_Transcriptomics.qmd script.
p <-plot_factors( MOFAmodel,factors ="all",color_by ="Clinical_score_(0-5)") +scale_fill_manual(values = pal$`Clinical_score_(0-5)`, na.value ="white")p
plot_data_heatmap has an interesting argument to “beautify” the heatmap: denoise = TRUE. Instead of plotting the (noisy) input data, we can plot the data reconstructed by the model, where noise has been removed:
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the MOFA2 package.
Please report the issue at <https://github.com/bioFAM/MOFA2>.